{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "view-in-github"
   },
   "source": [
    "<a href=\"https://colab.research.google.com/github/apache/beam/blob/master//Users/dcavazos/src/beam/examples/notebooks/documentation/transforms/python/elementwise/regex-py.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open in Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "view-the-docs-top"
   },
   "source": [
    "<table align=\"left\"><td><a target=\"_blank\" href=\"https://beam.apache.org/documentation/transforms/python/elementwise/regex\"><img src=\"https://beam.apache.org/images/logos/full-color/name-bottom/beam-logo-full-color-name-bottom-100.png\" width=\"32\" height=\"32\" />View the docs</a></td></table>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "id": "_-code"
   },
   "outputs": [],
   "source": [
    "#@title Licensed under the Apache License, Version 2.0 (the \"License\")\n",
    "# Licensed to the Apache Software Foundation (ASF) under one\n",
    "# or more contributor license agreements. See the NOTICE file\n",
    "# distributed with this work for additional information\n",
    "# regarding copyright ownership. The ASF licenses this file\n",
    "# to you under the Apache License, Version 2.0 (the\n",
    "# \"License\"); you may not use this file except in compliance\n",
    "# with the License. You may obtain a copy of the License at\n",
    "#\n",
    "#   http://www.apache.org/licenses/LICENSE-2.0\n",
    "#\n",
    "# Unless required by applicable law or agreed to in writing,\n",
    "# software distributed under the License is distributed on an\n",
    "# \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n",
    "# KIND, either express or implied. See the License for the\n",
    "# specific language governing permissions and limitations\n",
    "# under the License."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "regex"
   },
   "source": [
    "# Regex\n",
    "\n",
    "<script type=\"text/javascript\">\n",
    "localStorage.setItem('language', 'language-py')\n",
    "</script>\n",
    "\n",
    "<table align=\"left\" style=\"margin-right:1em\">\n",
    "  <td>\n",
    "    <a class=\"button\" target=\"_blank\" href=\"https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.util.html#apache_beam.transforms.util.Regex\"><img src=\"https://beam.apache.org/images/logos/sdks/python.png\" width=\"32px\" height=\"32px\" alt=\"Pydoc\"/> Pydoc</a>\n",
    "  </td>\n",
    "</table>\n",
    "\n",
    "<br/><br/><br/>\n",
    "\n",
    "Filters input string elements based on a regex. May also transform them based on the matching groups."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "setup"
   },
   "source": [
    "## Setup\n",
    "\n",
    "To run a code cell, you can click the **Run cell** button at the top left of the cell,\n",
    "or select it and press **`Shift+Enter`**.\n",
    "Try modifying a code cell and re-running it to see what happens.\n",
    "\n",
    "> To learn more about Colab, see\n",
    "> [Welcome to Colaboratory!](https://colab.sandbox.google.com/notebooks/welcome.ipynb).\n",
    "\n",
    "First, let's install the `apache-beam` module."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "setup-code"
   },
   "outputs": [],
   "source": [
    "!pip install --quiet -U apache-beam"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "examples"
   },
   "source": [
    "## Examples\n",
    "\n",
    "In the following examples, we create a pipeline with a `PCollection` of text strings.\n",
    "Then, we use the `Regex` transform to search, replace, and split through the text elements using\n",
    "[regular expressions](https://docs.python.org/3/library/re.html).\n",
    "\n",
    "You can use tools to help you create and test your regular expressions, such as\n",
    "[regex101](https://regex101.com/).\n",
    "Make sure to specify the Python flavor at the left side bar.\n",
    "\n",
    "Lets look at the\n",
    "[regular expression `(?P<icon>[^\\s,]+), *(\\w+), *(\\w+)`](https://regex101.com/r/Z7hTTj/3)\n",
    "for example.\n",
    "It matches anything that is not a whitespace `\\s` (`[ \\t\\n\\r\\f\\v]`) or comma `,`\n",
    "until a comma is found and stores that in the named group `icon`,\n",
    "this can match even `utf-8` strings.\n",
    "Then it matches any number of whitespaces, followed by at least one word character\n",
    "`\\w` (`[a-zA-Z0-9_]`), which is stored in the second group for the *name*.\n",
    "It does the same with the third group for the *duration*.\n",
    "\n",
    "> *Note:* To avoid unexpected string escaping in your regular expressions,\n",
    "> it is recommended to use\n",
    "> [raw strings](https://docs.python.org/3/reference/lexical_analysis.html?highlight=raw#string-and-bytes-literals)\n",
    "> such as `r'raw-string'` instead of `'escaped-string'`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "example-1-regex-match"
   },
   "source": [
    "### Example 1: Regex match\n",
    "\n",
    "`Regex.matches` keeps only the elements that match the regular expression,\n",
    "returning the matched group.\n",
    "The argument `group` is set to `0` (the entire match) by default,\n",
    "but can be set to a group number like `3`, or to a named group like `'icon'`.\n",
    "\n",
    "`Regex.matches` starts to match the regular expression at the beginning of the string.\n",
    "To match until the end of the string, add `'$'` at the end of the regular expression.\n",
    "\n",
    "To start matching at any point instead of the beginning of the string, use\n",
    "[`Regex.find(regex)`](#example-4-regex-find)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "example-1-regex-match-code"
   },
   "outputs": [],
   "source": [
    "import apache_beam as beam\n",
    "\n",
    "# Matches a named group 'icon', and then two comma-separated groups.\n",
    "regex = r'(?P<icon>[^\\s,]+), *(\\w+), *(\\w+)'\n",
    "with beam.Pipeline() as pipeline:\n",
    "  plants_matches = (\n",
    "      pipeline\n",
    "      | 'Garden plants' >> beam.Create([\n",
    "          '🍓, Strawberry, perennial',\n",
    "          '🥕, Carrot, biennial ignoring trailing words',\n",
    "          '🍆, Eggplant, perennial',\n",
    "          '🍅, Tomato, annual',\n",
    "          '🥔, Potato, perennial',\n",
    "          '# 🍌, invalid, format',\n",
    "          'invalid, 🍉, format',\n",
    "      ])\n",
    "      | 'Parse plants' >> beam.Regex.matches(regex)\n",
    "      | beam.Map(print)\n",
    "  )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "example-1-regex-match-2"
   },
   "source": [
    "<table align=\"left\" style=\"margin-right:1em\">\n",
    "  <td>\n",
    "    <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n",
    "  </td>\n",
    "</table>\n",
    "\n",
    "<br/><br/><br/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "example-2-regex-match-with-all-groups"
   },
   "source": [
    "### Example 2: Regex match with all groups\n",
    "\n",
    "`Regex.all_matches` keeps only the elements that match the regular expression,\n",
    "returning *all groups* as a list.\n",
    "The groups are returned in the order encountered in the regular expression,\n",
    "including `group 0` (the entire match) as the first group.\n",
    "\n",
    "`Regex.all_matches` starts to match the regular expression at the beginning of the string.\n",
    "To match until the end of the string, add `'$'` at the end of the regular expression.\n",
    "\n",
    "To start matching at any point instead of the beginning of the string, use\n",
    "[`Regex.find_all(regex, group=Regex.ALL, outputEmpty=False)`](#example-5-regex-find-all)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "example-2-regex-match-with-all-groups-code"
   },
   "outputs": [],
   "source": [
    "import apache_beam as beam\n",
    "\n",
    "# Matches a named group 'icon', and then two comma-separated groups.\n",
    "regex = r'(?P<icon>[^\\s,]+), *(\\w+), *(\\w+)'\n",
    "with beam.Pipeline() as pipeline:\n",
    "  plants_all_matches = (\n",
    "      pipeline\n",
    "      | 'Garden plants' >> beam.Create([\n",
    "          '🍓, Strawberry, perennial',\n",
    "          '🥕, Carrot, biennial ignoring trailing words',\n",
    "          '🍆, Eggplant, perennial',\n",
    "          '🍅, Tomato, annual',\n",
    "          '🥔, Potato, perennial',\n",
    "          '# 🍌, invalid, format',\n",
    "          'invalid, 🍉, format',\n",
    "      ])\n",
    "      | 'Parse plants' >> beam.Regex.all_matches(regex)\n",
    "      | beam.Map(print)\n",
    "  )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "example-2-regex-match-with-all-groups-2"
   },
   "source": [
    "<table align=\"left\" style=\"margin-right:1em\">\n",
    "  <td>\n",
    "    <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n",
    "  </td>\n",
    "</table>\n",
    "\n",
    "<br/><br/><br/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "example-3-regex-match-into-key-value-pairs"
   },
   "source": [
    "### Example 3: Regex match into key-value pairs\n",
    "\n",
    "`Regex.matches_kv` keeps only the elements that match the regular expression,\n",
    "returning a key-value pair using the specified groups.\n",
    "The argument `keyGroup` is set to a group number like `3`, or to a named group like `'icon'`.\n",
    "The argument `valueGroup` is set to `0` (the entire match) by default,\n",
    "but can be set to a group number like `3`, or to a named group like `'icon'`.\n",
    "\n",
    "`Regex.matches_kv` starts to match the regular expression at the beginning of the string.\n",
    "To match until the end of the string, add `'$'` at the end of the regular expression.\n",
    "\n",
    "To start matching at any point instead of the beginning of the string, use\n",
    "[`Regex.find_kv(regex, keyGroup)`](#example-6-regex-find-as-key-value-pairs)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "example-3-regex-match-into-key-value-pairs-code"
   },
   "outputs": [],
   "source": [
    "import apache_beam as beam\n",
    "\n",
    "# Matches a named group 'icon', and then two comma-separated groups.\n",
    "regex = r'(?P<icon>[^\\s,]+), *(\\w+), *(\\w+)'\n",
    "with beam.Pipeline() as pipeline:\n",
    "  plants_matches_kv = (\n",
    "      pipeline\n",
    "      | 'Garden plants' >> beam.Create([\n",
    "          '🍓, Strawberry, perennial',\n",
    "          '🥕, Carrot, biennial ignoring trailing words',\n",
    "          '🍆, Eggplant, perennial',\n",
    "          '🍅, Tomato, annual',\n",
    "          '🥔, Potato, perennial',\n",
    "          '# 🍌, invalid, format',\n",
    "          'invalid, 🍉, format',\n",
    "      ])\n",
    "      | 'Parse plants' >> beam.Regex.matches_kv(regex, keyGroup='icon')\n",
    "      | beam.Map(print)\n",
    "  )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "example-3-regex-match-into-key-value-pairs-2"
   },
   "source": [
    "<table align=\"left\" style=\"margin-right:1em\">\n",
    "  <td>\n",
    "    <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n",
    "  </td>\n",
    "</table>\n",
    "\n",
    "<br/><br/><br/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "example-4-regex-find"
   },
   "source": [
    "### Example 4: Regex find\n",
    "\n",
    "`Regex.find` keeps only the elements that match the regular expression,\n",
    "returning the matched group.\n",
    "The argument `group` is set to `0` (the entire match) by default,\n",
    "but can be set to a group number like `3`, or to a named group like `'icon'`.\n",
    "\n",
    "`Regex.find` matches the first occurrence of the regular expression in the string.\n",
    "To start matching at the beginning, add `'^'` at the beginning of the regular expression.\n",
    "To match until the end of the string, add `'$'` at the end of the regular expression.\n",
    "\n",
    "If you need to match from the start only, consider using\n",
    "[`Regex.matches(regex)`](#example-1-regex-match)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "example-4-regex-find-code"
   },
   "outputs": [],
   "source": [
    "import apache_beam as beam\n",
    "\n",
    "# Matches a named group 'icon', and then two comma-separated groups.\n",
    "regex = r'(?P<icon>[^\\s,]+), *(\\w+), *(\\w+)'\n",
    "with beam.Pipeline() as pipeline:\n",
    "  plants_matches = (\n",
    "      pipeline\n",
    "      | 'Garden plants' >> beam.Create([\n",
    "          '# 🍓, Strawberry, perennial',\n",
    "          '# 🥕, Carrot, biennial ignoring trailing words',\n",
    "          '# 🍆, Eggplant, perennial - 🍌, Banana, perennial',\n",
    "          '# 🍅, Tomato, annual - 🍉, Watermelon, annual',\n",
    "          '# 🥔, Potato, perennial',\n",
    "      ])\n",
    "      | 'Parse plants' >> beam.Regex.find(regex)\n",
    "      | beam.Map(print)\n",
    "  )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "example-4-regex-find-2"
   },
   "source": [
    "<table align=\"left\" style=\"margin-right:1em\">\n",
    "  <td>\n",
    "    <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n",
    "  </td>\n",
    "</table>\n",
    "\n",
    "<br/><br/><br/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "example-5-regex-find-all"
   },
   "source": [
    "### Example 5: Regex find all\n",
    "\n",
    "`Regex.find_all` returns a list of all the matches of the regular expression,\n",
    "returning the matched group.\n",
    "The argument `group` is set to `0` by default, but can be set to a group number like `3`, to a named group like `'icon'`, or to `Regex.ALL` to return all groups.\n",
    "The argument `outputEmpty` is set to `True` by default, but can be set to `False` to skip elements where no matches were found.\n",
    "\n",
    "`Regex.find_all` matches the regular expression anywhere it is found in the string.\n",
    "To start matching at the beginning, add `'^'` at the start of the regular expression.\n",
    "To match until the end of the string, add `'$'` at the end of the regular expression.\n",
    "\n",
    "If you need to match all groups from the start only, consider using\n",
    "[`Regex.all_matches(regex)`](#example-2-regex-match-with-all-groups)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "example-5-regex-find-all-code"
   },
   "outputs": [],
   "source": [
    "import apache_beam as beam\n",
    "\n",
    "# Matches a named group 'icon', and then two comma-separated groups.\n",
    "regex = r'(?P<icon>[^\\s,]+), *(\\w+), *(\\w+)'\n",
    "with beam.Pipeline() as pipeline:\n",
    "  plants_find_all = (\n",
    "      pipeline\n",
    "      | 'Garden plants' >> beam.Create([\n",
    "          '# 🍓, Strawberry, perennial',\n",
    "          '# 🥕, Carrot, biennial ignoring trailing words',\n",
    "          '# 🍆, Eggplant, perennial - 🍌, Banana, perennial',\n",
    "          '# 🍅, Tomato, annual - 🍉, Watermelon, annual',\n",
    "          '# 🥔, Potato, perennial',\n",
    "      ])\n",
    "      | 'Parse plants' >> beam.Regex.find_all(regex)\n",
    "      | beam.Map(print)\n",
    "  )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "example-5-regex-find-all-2"
   },
   "source": [
    "<table align=\"left\" style=\"margin-right:1em\">\n",
    "  <td>\n",
    "    <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n",
    "  </td>\n",
    "</table>\n",
    "\n",
    "<br/><br/><br/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "example-6-regex-find-as-key-value-pairs"
   },
   "source": [
    "### Example 6: Regex find as key-value pairs\n",
    "\n",
    "`Regex.find_kv` returns a list of all the matches of the regular expression,\n",
    "returning a key-value pair using the specified groups.\n",
    "The argument `keyGroup` is set to a group number like `3`, or to a named group like `'icon'`.\n",
    "The argument `valueGroup` is set to `0` (the entire match) by default,\n",
    "but can be set to a group number like `3`, or to a named group like `'icon'`.\n",
    "\n",
    "`Regex.find_kv` matches the first occurrence of the regular expression in the string.\n",
    "To start matching at the beginning, add `'^'` at the beginning of the regular expression.\n",
    "To match until the end of the string, add `'$'` at the end of the regular expression.\n",
    "\n",
    "If you need to match as key-value pairs from the start only, consider using\n",
    "[`Regex.matches_kv(regex)`](#example-3-regex-match-into-key-value-pairs)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "example-6-regex-find-as-key-value-pairs-code"
   },
   "outputs": [],
   "source": [
    "import apache_beam as beam\n",
    "\n",
    "# Matches a named group 'icon', and then two comma-separated groups.\n",
    "regex = r'(?P<icon>[^\\s,]+), *(\\w+), *(\\w+)'\n",
    "with beam.Pipeline() as pipeline:\n",
    "  plants_matches_kv = (\n",
    "      pipeline\n",
    "      | 'Garden plants' >> beam.Create([\n",
    "          '# 🍓, Strawberry, perennial',\n",
    "          '# 🥕, Carrot, biennial ignoring trailing words',\n",
    "          '# 🍆, Eggplant, perennial - 🍌, Banana, perennial',\n",
    "          '# 🍅, Tomato, annual - 🍉, Watermelon, annual',\n",
    "          '# 🥔, Potato, perennial',\n",
    "      ])\n",
    "      | 'Parse plants' >> beam.Regex.find_kv(regex, keyGroup='icon')\n",
    "      | beam.Map(print)\n",
    "  )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "example-6-regex-find-as-key-value-pairs-2"
   },
   "source": [
    "<table align=\"left\" style=\"margin-right:1em\">\n",
    "  <td>\n",
    "    <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n",
    "  </td>\n",
    "</table>\n",
    "\n",
    "<br/><br/><br/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "example-7-regex-replace-all"
   },
   "source": [
    "### Example 7: Regex replace all\n",
    "\n",
    "`Regex.replace_all` returns the string with all the occurrences of the regular expression replaced by another string.\n",
    "You can also use\n",
    "[backreferences](https://docs.python.org/3/library/re.html?highlight=backreference#re.sub)\n",
    "on the `replacement`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "example-7-regex-replace-all-code"
   },
   "outputs": [],
   "source": [
    "import apache_beam as beam\n",
    "\n",
    "with beam.Pipeline() as pipeline:\n",
    "  plants_replace_all = (\n",
    "      pipeline\n",
    "      | 'Garden plants' >> beam.Create([\n",
    "          '🍓 : Strawberry : perennial',\n",
    "          '🥕 : Carrot : biennial',\n",
    "          '🍆\\t:\\tEggplant\\t:\\tperennial',\n",
    "          '🍅 : Tomato : annual',\n",
    "          '🥔 : Potato : perennial',\n",
    "      ])\n",
    "      | 'To CSV' >> beam.Regex.replace_all(r'\\s*:\\s*', ',')\n",
    "      | beam.Map(print)\n",
    "  )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "example-7-regex-replace-all-2"
   },
   "source": [
    "<table align=\"left\" style=\"margin-right:1em\">\n",
    "  <td>\n",
    "    <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n",
    "  </td>\n",
    "</table>\n",
    "\n",
    "<br/><br/><br/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "example-8-regex-replace-first"
   },
   "source": [
    "### Example 8: Regex replace first\n",
    "\n",
    "`Regex.replace_first` returns the string with the first occurrence of the regular expression replaced by another string.\n",
    "You can also use\n",
    "[backreferences](https://docs.python.org/3/library/re.html?highlight=backreference#re.sub)\n",
    "on the `replacement`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "example-8-regex-replace-first-code"
   },
   "outputs": [],
   "source": [
    "import apache_beam as beam\n",
    "\n",
    "with beam.Pipeline() as pipeline:\n",
    "  plants_replace_first = (\n",
    "      pipeline\n",
    "      | 'Garden plants' >> beam.Create([\n",
    "          '🍓, Strawberry, perennial',\n",
    "          '🥕, Carrot, biennial',\n",
    "          '🍆,\\tEggplant, perennial',\n",
    "          '🍅, Tomato, annual',\n",
    "          '🥔, Potato, perennial',\n",
    "      ])\n",
    "      | 'As dictionary' >> beam.Regex.replace_first(r'\\s*,\\s*', ': ')\n",
    "      | beam.Map(print)\n",
    "  )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "example-8-regex-replace-first-2"
   },
   "source": [
    "<table align=\"left\" style=\"margin-right:1em\">\n",
    "  <td>\n",
    "    <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n",
    "  </td>\n",
    "</table>\n",
    "\n",
    "<br/><br/><br/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "example-9-regex-split"
   },
   "source": [
    "### Example 9: Regex split\n",
    "\n",
    "`Regex.split` returns the list of strings that were delimited by the specified regular expression.\n",
    "The argument `outputEmpty` is set to `False` by default, but can be set to `True` to keep empty items in the output list."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "id": "example-9-regex-split-code"
   },
   "outputs": [],
   "source": [
    "import apache_beam as beam\n",
    "\n",
    "with beam.Pipeline() as pipeline:\n",
    "  plants_split = (\n",
    "      pipeline\n",
    "      | 'Garden plants' >> beam.Create([\n",
    "          '🍓 : Strawberry : perennial',\n",
    "          '🥕 : Carrot : biennial',\n",
    "          '🍆\\t:\\tEggplant : perennial',\n",
    "          '🍅 : Tomato : annual',\n",
    "          '🥔 : Potato : perennial',\n",
    "      ])\n",
    "      | 'Parse plants' >> beam.Regex.split(r'\\s*:\\s*')\n",
    "      | beam.Map(print)\n",
    "  )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "example-9-regex-split-2"
   },
   "source": [
    "<table align=\"left\" style=\"margin-right:1em\">\n",
    "  <td>\n",
    "    <a class=\"button\" target=\"_blank\" href=\"https://github.com/apache/beam/blob/master/sdks/python/apache_beam/examples/snippets/transforms/elementwise/regex.py\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" width=\"32px\" height=\"32px\" alt=\"View source code\"/> View source code</a>\n",
    "  </td>\n",
    "</table>\n",
    "\n",
    "<br/><br/><br/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "related-transforms"
   },
   "source": [
    "## Related transforms\n",
    "\n",
    "* [FlatMap](https://beam.apache.org/documentation/transforms/python/elementwise/flatmap) behaves the same as `Map`, but for\n",
    "  each input it may produce zero or more outputs.\n",
    "* [Map](https://beam.apache.org/documentation/transforms/python/elementwise/map) applies a simple 1-to-1 mapping function over each element in the collection\n",
    "\n",
    "<table align=\"left\" style=\"margin-right:1em\">\n",
    "  <td>\n",
    "    <a class=\"button\" target=\"_blank\" href=\"https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.util.html#apache_beam.transforms.util.Regex\"><img src=\"https://beam.apache.org/images/logos/sdks/python.png\" width=\"32px\" height=\"32px\" alt=\"Pydoc\"/> Pydoc</a>\n",
    "  </td>\n",
    "</table>\n",
    "\n",
    "<br/><br/><br/></icon>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "view-the-docs-bottom"
   },
   "source": [
    "<table align=\"left\"><td><a target=\"_blank\" href=\"https://beam.apache.org/documentation/transforms/python/elementwise/regex\"><img src=\"https://beam.apache.org/images/logos/full-color/name-bottom/beam-logo-full-color-name-bottom-100.png\" width=\"32\" height=\"32\" />View the docs</a></td></table>"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "name": "Regex - element-wise transform",
   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "python3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
